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Abstract 

Mixtures of linear mixed models (MLMMs) are useful for clustering grouped data 
and can be estimated by likelihood maximization through the EM algorithm. The 
conventional approach to determining a suitable number of components is to compare 
different mixture models using penalized log-likelihood criteria such as BIC. We propose 
fitting MLMMs with variational methods which can perform parameter estimation 
and model selection simultaneously. A variational approximation is described where 
the variational lower bound and parameter updates are in closed form, allowing fast 
evaluation. A new variational greedy algorithm is developed for model selection and 
learning of the mixture components. This approach allows an automatic initialization of 
PLh the algorithm and returns a plausible number of mixture components automatically. In 

cases of weak identifiability of certain model parameters, we use hierarchical centering 
to reparametrize the model and show empirically that there is a gain in efficiency 

+-> 

by variational algorithms similar to that in MCMC algorithms. Related to this, we 
prove that the approximate rate of convergence of variational algorithms by Gaussian 

CN 

J> approximation is equal to that of the corresponding Gibbs sampler which suggests that 

reparametrizations can lead to improved convergence in variational algorithms as well. 
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centering. 



.> 1 Introduction 

X 

Mixtures of linear mixed models (MLMMs) are useful for clustering grouped data in applica- 
tions such as clustering of gene expression profiles (Celeux et al., 2005, and Ng et al., 2006) 
and electrical load series (Coke and Tsao, 2010). We consider MLMMs where the response 
distribution is a normal mixture with the mixture weights varying as a function of the co- 
variates. Our model includes cluster-specific random effects so that observations from the 
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same cluster are correlated. We propose fitting MLMMs with variational methods which can 
perform parameter estimation and model selection simultaneously. Our article makes four 
contributions. First, fast variational methods are developed for MLMMs and a variational 
lower bound is obtained in closed form. Second, a new variational greedy algorithm is de- 
veloped for model selection and learning of the mixture components. This approach handles 
algorithm initialization and returns a plausible number of mixture components automatically. 
Third, we show empirically that there is a gain in efficiency by variational algorithms through 
the use of hierarchical centering reparametrization similar to that in Markov chain Monte 
Carlo (MCMC) algorithms. Fourth, we prove that the approximate rate of convergence of 
the variational algorithm by Gaussian approximation is equal to that of the corresponding 
Gibbs sampler which suggests that reparametrizations can give improved convergence in 
variational algorithms just as in MCMC algorithms. 

In microarray analysis, clustering of gene expression profiles is a valuable exploratory tool 
for the identification of potentially meaningful relationships between genes. In the model- 
based cluster analysis context, Luan and Li (2003) studied the clustering of genes based on 
time course gene expression profiles in the mixture model framework using a mixed-effects 
model with B-splines. Celeux et al. (2005) proposed using MLMMs to account for data 
variability in repeated measurements. Both of these approaches require the independence 
assumption for genes. In contrast, Ng et al. (2006) considered MLMMs which allow genes 
within a cluster to be correlated as the independence assumption may not hold for all pairs 
of genes (McLachlan et al, 2004). Booth et al. (2008) considered a multilevel linear mixed 
model (LMM) which includes cluster-specific random effects and proposed a stochastic search 
algorithm for finding partitions of the data with high posterior probability through maxi- 
mization of an objective function. For the clustering of electrical load series, Coke and Tsao 
(2010) developed random effects mixture models using a hierarchical representation and used 
an antedependence model for the non-stationary random effects. 

MLMMs can be estimated by likelihood maximization through the EM algorithm (Demp- 
ster et al, 1977) and this method was used in Luan and Li (2003), Celeux et al. (2005) and 
Coke and Tsao (2010). Ng et al. (2006) developed a program called EMMIX-WIRE (EM- 
based Mixture analysis With Random Effects) for clustering correlated and replicated data. 
The optimal number of components was determined by comparing different mixture models 
using the BIC (Bayesian information criterion) of Schwarz (1978) in these articles. The EM 
algorithm can be sensitive to initialization and is commonly run from multiple starting values 
to avoid convergence to local optima. Scharl et al. (2010) studied the performance of different 
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initialization strategies for mixtures of regression models. In the context of Gaussian mixture 
models, Biernacki et al. (2003) compared simple initialization strategies and Verbeek et al. 
(2003) discussed a greedy approach to the learning of Gaussian mixtures which resolves the 
sensitivity to initialization and is useful in finding the optimal number of components. 

We propose fitting MLMMs with variational methods using a greedy algorithm. The 
MLMM we consider is a simple generalization of that proposed by Ng et al. (2006) where units 
within each cluster may be correlated. A variational approximation for this model is described 
where the variational lower bound and parameter updates are in closed form, allowing fast 
evaluation. Ormerod and Wand (2010) illustrated the use of variational methods to fit a 
Gaussian LMM and Armagan and Dunson (2011) used variational methods to obtain sparse 
approximate Bayes inference in the analysis of large longitudinal data sets using LMMs. 
Ormerod and Wand (2012) recently introduced an approach called Gaussian variational 
approximation for fitting generalized LMMs where the distributions of random effects vectors 
are approximated by Gaussian distributions. The variational algorithm suffers from problems 
of local optima and initialization strategies for the EM algorithm can often be adapted for 
use with the variational algorithm. A common strategy is to run the variational algorithm 
starting with random initialization from multiple starting points (Bishop and Svensen, 2003). 
Nott et al. (2011) used a short runs strategy similar to that recommended by Biernacki et 
al. (2003) where the variational algorithm is stopped prematurely and only the short run 
with the highest attained value of the variational lower bound is followed to convergence. 

A key advantage of variational methods is the potential for simultaneous parameter es- 
timation and model selection and a number of such methods have been developed for the 
fitting of Gaussian mixtures. Ueda and Ghahramani (2002) proposed a variational Bayesian 
(VB) split and merge EM procedure to optimize an objective function that allows simul- 
taneous estimation of the parameters and the number of components while avoiding local 
optima. They applied this method to a Gaussian mixture and a mixture of experts regression 
where both input and output are treated as random variables. Wu et al. (2012) developed a 
split and eliminate VB algorithm which attempts to split all poorly fitted components at the 
same time and made use of the component-elimination property associated with variational 
approximation so that no merge moves are required. This component-elimination property 
was noted previously by Attias (1999) and Corduneanu and Bishop (2001). McGrory and 
Titterington (2007) described a variational optimization technique where the algorithm is 
initialised with a large number of components and mixture components whose weightings 
become sufficiently small are dropped out as the optimization proceeds. Constantinopoulos 
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and Likas (2007) observed that in this approach, the number of components in the resulting 
mixture can be sensitive to the prior on the precision matrix. They proposed an incremen- 
tal approach where components are added to the mixture following a splitting test where 
a different local precision prior is specified after taking into account characteristics of the 
precision matrix of the component being tested. 

For the examples in this paper, we have attempted the component deletion approach of 
McGrory and Titterington (2007) (results not shown). We observed that this method is more 
effective when the number of components required is not too large as initializing the mixture 
with a large number of components can be computationally expensive especially for large 
data sets. The choice of the initial number of mixture components can have an impact on 
the resulting number of components and it may not be easy to determine a suitable initial 
number. This approach remains sensitive to initialization and methods such as running the 
variational algorithm from multiple starting points are necessary to avoid local optima. 

We develop a new variational greedy algorithm (VGA) for the learning of MLMMs. This 
greedy approach is not limited to MLMMs and may be extended to fit other models using 
variational methods. No additional derivations are required once the basic variational al- 
gorithm is available. Starting with one component, the VGA adds new components to the 
mixture after searching for the optimal way to split components in the current mixture. This 
approach handles algorithm initialization automatically and returns a plausible value for 
the number of mixture components. While this bottom-up approach resolves the difficulty 
of estimating the upper bound of the number of mixture components, it can become time- 
consuming when the number of components is large, since a larger number of components 
have to be tested to find the optimal way of splitting each one. Some measures are intro- 
duced to keep the search time short and the component elimination property of variational 
approximation is used to sieve out components which resist splitting. 

In situations where there is weak identification of certain model parameters and the 
variational algorithm converges very slowly, we apply hierarchical centering (Gelfand et al, 
1995) to reparametrize the MLMM. Hierarchical centering has been applied successfully 
in MCMC algorithms to obtain improved convergence (Chen et al, 2000) and we show 
empirically, that there is a similar gain in efficiency in variational algorithms. We consider 
a case of partial centering, a second case of full centering and derive the corresponding 
variational algorithms. Related to this, we show that the approximate rate of convergence of 
the variational algorithm by Gaussian approximation is equal to that of the corresponding 
Gibbs sampler. Sahu and Roberts (1999) showed that the approximate rate of convergence of 
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the Gibbs sampler by Gaussian approximation is equal to that of the corresponding EM-type 
algorithm and hence improvement strategies for one algorithm can be used for the other. 
As reparametrizations using hierarchical centering can lead to improved convergence in the 
Gibbs sampler, this result suggests that the rate of convergence of variational algorithms may 
be improved through reparametrizations. Papaspiliopoulos et al. (2007) describe centering 
and non-centering methodology as complementary techniques for use in parametrization of 
hierarchical models to construct effective MCMC algorithms. 

In Section 2, we introduce MLMMs. Section 3 describes fast variational approximation 
methods for MLMMs and Section 4 reparametrization of MLMMs through hierarchical cen- 
tering. Section 5 describes the variational greedy algorithm and Section 6 contains theoretical 
results on the rate of convergence of variational algorithms by Gaussian approximation. Sec- 
tion 7 considers examples involving real and simulated data and Section 8 concludes. 

2 Mixtures of linear mixed models 

The MLMM we consider is a generalization of that proposed by Ng et al. (2006), where units 
from the same cluster share cluster-specific random effects and are hence correlated. Unlike 
Ng et al. (2006), our model can fit data where the number of observations on each unit are 
not equal and we allow the mixture weights to vary with covariates between clusters. Suppose 
we observe n multivariate reponses yi = (yn, ...,y in .) T , i — 1, n and N = Ym=i Ui - ^ e t ^ ne 
number of mixture components be k and z^, i — 1, ...,n be latent variables indicating which 
mixture component the ith cluster corresponds to, Zi G {1, k}. Conditional on Zi = j, 

Vi = XSj + W t a t + Vibj + ei (1) 

where X«, Wi and Vi are design matrices of dimensions n« xp, riiX s\ and rii x s 2 respectively, 
/3j, j = 1, k, are p x 1 vectors of fixed effects, Oj, % — 1, n, are si x 1 vectors of random 
effects, bj, j = 1, k, are s 2 x 1 vectors of random effects and % — 1, ...,n, are vectors of 
random errors. We assume that the random effects Oj, % = 1, n, bj, j = 1, k, and the error 
vectors % = l,...,n, are mutually independent. The fixed effects, the distribution of the 
random effects and the distribution of the error terms are all mixture component specific. The 
random effects distribution for Oj and bj are N(0,a^j si ) and N(0,a% I S2 ) respectively. The 
error vector Q is distributed as iV (0, S^) where = blockdiag(cr| 1 / Kil , <jj g I Kig ), a block 
diagonal with the Ith block equal to cr| z J Kii . Here g is constant for all i and Yli=i K u = n i f° r 
each i = 1, n. In microarray experiments for instance, this specification provides increased 
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flexibility as the error variance of each mixture component is allowed to vary between different 
experiments, say, by setting g to be the total number of experiments. We assume that 



" P '^' (2) 
exp« Si) 

where Ui = (un, u ic i) T is a vector of covariates and Sj = (Sji, Sjd) T are vectors of 
unknown parameters j = 2, /c. We set Si = for identifiability. These mixing coefficients, 
which are functions of the covariates, are known as gating functions in the mixture of experts 
terminology (Jacobs et al, 1991). This model for the mixture component indicators allows the 
mixture weights to vary with covariates across clusters. For Bayesian inference on unknown 
parameters we assume the following priors. a\. ~ IG(a aj , X aj ), j = 1, k, where IG(a, A) 
denotes the inverse gamma distribution with shape parameter a and scale parameter A, 
a\. ~ IG(a bj ,X bj ), j = l,...,k, a) x ~ IG{a jU Xji), j = l,...,k, I = l,...,g, S = (6%, -,Sl) T ~ 



iV(0,S 5 ) and fa ~ N(0,E Pj ). Here a a ., X aj , a b ., X bj , a jh X jh E 5 and j = l,...,k, 
I = 1, g, are hyperparameters considered known. 



3 Variational approximation 

Variational methods originated in statistical physics and research into these approaches is 
currently very active in both statistics and machine learning. Until recently, variational 
approximation methods have mostly been developed in the machine learning community 
(Jordan et al, 1999, Winn and Bishop, 2005). See Ormerod and Wand (2010) for an expla- 
nation of variational approximation methods and the introduction for further references on 
application of variational methods to mixture models specifically. We consider a variational 
approximation to the joint posterior distribution of all the parameters 9 of the form q(9\X) 
where A is the set of variational parameters to be chosen. Here a parametric form is chosen 
for q(9\X) and we attempt to make q(9\X) a good approximation to p(9\y) by minimizing the 
Kullback-Leibler (KL) divergence between q(9\X) and p(9\y), i.e., 

where p(y) = f p(y\9)p{9) d9 is the marginal likelihood. As the KL divergence is positive, 
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which gives a lower bound on the log marginal likelihood, and maximization of this lower 
bound is equivalent to minimisation of the KL divergence between the posterior distribution 
and variational approximation. This lower bound is sometimes used as an approximation to 
the log marginal likelihood for Bayesian model selection purposes. 

Write = F, a = (aj , alf , b = (&f,...,&D T , *S = (*2i,-,<&) r , = 

«,-,<) T , °) = K o] g ) T , 3 = l,-,k, o* = (af,...,aff, 5 = ...,tf) r and 

z = (zi,...,z n ) T so that 9 = ((3 T , a T , b T , a\ , of ,a 2 ,5 T ,z T ) T . For convenience we write 
q(9\X) as q(9), suppressing dependence on A and consider a variational approximation of the 
form q(9) = q(f3)q(a)q(b)q(al)q(a 2 b )q(a 2 )q(5)q(z), where 

k n k n 

j'=l i=l j=l i=l 

k k k Q 

j=i j=i j=i «=i 

and is A^.,^.), j = 1, g(a,) is iV«,X^), i = l,...,n, #,) is A^.,E«.), 

j = 1,. ..,£;, is /G«.,A«.), j = 1, ...,£;, g(<rg.) is IG(a q bj , X 9 bj ), j = l,...,k, g(<rj) is 

IG(pt q ^ A'j), for j = 1, k, I = 1, g, q(5) is a delta function placing a point mass of 1 on 

and = j) = ^ where X^=i 9»j = 1, = 1, We are assuming in the variational 
posterior that parameters for different expert components are independent and independent 
of all other parameters. 

For a variational posterior restricted to be of the factorized form q(9) = Y\T=i <?(^i)> the 
optimal q(9i) minimizing the KL divergence is given by q(9i) oc exp{E_ e . log p(y, 9)} (see, 
for example, Ormerod and Wand, 2010). In our case, the specific distributional forms for the 
variational posterior densities, such as the assumption of a degenerate point mass variational 
posterior for 5, have been chosen to make computation of the lower bound tractable even 
though they might not be optimal. It is also possible to consider the fixed effects /3, and the 
random effects a and b as a single block and replace q(f3)q(a)q(b) by q((3, a, b) as in Ormerod 
and Wand (2010). This results in a less restricted factorization with dependence structure 
between (3, a and b preserved and a higher lower bound can be achieved. However, this will 
involve dealing with high dimensional sparse covariance matrices which creates a greater 
computational burden although it is possible to use matrix inversion results for the blocked 
matrices to attain better computational efficiency (referee's suggestion). We have decided 
to use a factorized form for faster computation and better scalability to larger data sets 
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(see Armagan and Dunson, 2011). The independence and distributional assumptions made 
in variational approximations may not be realistic and it has been shown in the context of 
Gaussian mixture models that VB, which assumes a factorized posterior, has a tendency to 
underestimate the posterior variance (Wang and Titterington, 2005). However, variational 
approximation can often lead to good point estimates, reasonable estimates of marginal 
posterior distributions and excellent predictive inferences. Blei and Jordan (2006) showed 
that predictive distributions based on variational approximations to the posterior were very 
similar to that of MCMC for Dirichlet process mixture models. Braun and McAuliffe (2010) 
reported similar findings in large-scale models of discrete choice although they observed that 
the variational posterior is more concentrated around the mode than the MCMC posterior, 
a familiar underdispersion effect noted above. Similar independence assumptions have been 
made in the case of the LMM by Armagan and Dunson (2011). 

Now, we want to maximise the variational lower bound L = / log P ^^ 6 \ {9) d0 with 
respect to the parameters A in our variational posterior approximation. The lower bound L 
can be computed in closed form, and is given by (details in supplementary materials) 

it fog - tr(E- S y - 4, T T$4, + log |E«.| - f („« X + tr(E ?j ) 

j=l L b i 

+ log at + log fK) ~ ~a« ~2 log(A ^ } + ^ + "*> log ai; + log fK) 
- l2 7 lgu w<) - MAy) + v«)K - <.) - -j^f- + < } + 2 E lo s 

+ EE log ^ + log ^ + - log(AS)) + iK*%)( ajl - aj) 

j=i 1=1 I ^ i l > 

J j=l jr' = l J 

f p i Pii n / «\ , ^(P + S2) +ns x - iVlog(27r) 
+ Z^2^^ log ^ + logp(/i") + ~ (3) 



i=i i=i 



where T(-) and tp(-) denote the gamma and digamma functions respectively, Pij is evaluated 
by setting 5 = fj, q s , p(nf) denotes the prior distribution for 5 evaluated at fi q 5 , ^ = t/i — Xi/ip. — 

Wml ~ V ^ = blockdiag J/, .,) and Ay = X,E* Xf + WiE^Wf + 

V^£u\y^ . The variational parameters to be optimized consist of /i^ , E^ , /i£ , E£ , a^, A^., 
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«V A V for J = i'-'fc' s a,> for « = aj, Aj„ for j = l,...,k, I = 1, ft, 

for i = l,...,n, j = 1,...,A;, and (/,$. We optimize the lower bound with respect to each of 
these sets of parameters with the others held fixed in a gradient ascent algorithm. All updates 
except for n\ are available in closed form and can be derived using vector differential calculus 
(see Wand, 2002). 

Algorithm 1: 

a q 

Initialize: for i — 1, n, j = 1, k, -A- for j = 1, k, I — 1, g, \x q a . for i — 1, n, 

A*'., a?" an( i ^T" for j = 1, fc. Do until the change in the lower bound between iterations 
is less than a tolerance: 

3. ^ <- (E- =1 ^^/ S1 +wf (E^i^-s?- 1 )^)" 1 . 
6- A 3 <- ^ EIU ?«vf sj ■■-'(w - - 

7. «2>« flj + fEL*r 

8. A^ <- X aj + | E"=i %'{< T ^ S + tr(S^)}. 

9. a* <- a bj + f . 

10. A^A 6j + H^X+tr(E^}. 

11. a* <- ttj, + !£r=i9«««- 

12. AJ <- Ajj + | Er=i + tr(A^) Ki J 

where ((^) Kil , (&j) Kj9 ) is the partition of ^ corresponding to the (kh, re i9 ) and 
(Aij) Kil is the diagonal block of A^- with rows and columns corresponding to the position 
of k u within (/c a , re ig ). 



9 



13. Set pig to be the conditional mode of the lower bound fixing other variational parameters 
at their current values. As a function of pig, the lower bound is the log posterior for a 
Bayesian multinomial regression with the ith response being (q a , q ik ) T and a normal 
prior on pig. The usual iteratively weighted least squares algorithm (or other numerical 
optimization algorithm) can be used for finding the mode. 

14 qij <" Tl^^L ' Where ° ij = a WK) - M^)} ~ Sty a ^l + tr(S^)} + 

\ £2=1 «T{V(oj,) - lo g(^)> - im^ -1 ^) + ^r 1 ^}- 



In the examples, when Algorithm 1 is used in conjunction with the VGA described in Section 

otl- a b- a q 

5 to fit a 1-component mixture, for j = 1, we set -^r = yf = 1, -rw- = 1 for / = 1, o, 
pi\. = 0, pfl a . = for i — 1, ...,n, and g^- = 1 for i = 1, ...,n for initialization. 

The variational posterior for 5 has been assumed to be a degenerate point mass to make 
computation of the lower bound tractable. However, at convergence, we relax the form of q(5) 
to be a normal distribution. Suppose q(5) is not subjected to any distributional restriction, 
the optimal choice for this term is given by 



q{6) oc exp < ^ QH lo £Pij ~ \ 5T ^& 15 \ where Pa 
{ i=i j=i ~ J 



expjufSj) 
J2i exp(uj5i) 



(4) 



If pig is close to the mode, we can get a normal approximation to q(5) by setting pig as the 
mean and the covariance matrix T> q s as the negative inverse Hessian of the log of (4) which is 
the Bayesian multinomial log posterior considered in step 13 of Algorithm 1. Waterhouse et 
al. (1996) outlined a similar idea which they used at every step of their iterative algorithm. 
We recommend first using a delta function approximation in the VGA and doing a one-step 
approximation after the algorithm has converged (see Nott et al, 2011). Using the normal 
approximation N ([/,$, Tig) as the variational posterior for q(5), the variational lower bound is 
the same as in (3) except that £™ =1 Y^=i % ^°&Pij + l gp(/-*<s) is replaced with 

^ L Q*E q (log ^ eMuJ§i) j + - log [E, E«|--^ E 5 ^--tr(E, E«) + . 

The expectation of the first term, E q {log j ^Tip(Jri;) }) > * s no ^ available in closed form and 
we replace it with log < ==. -, T \ , > where pig. is the subvector of pig corresponding to Sj, 



E;cxp(«f^ ; ) 

j = 2, fc, to obtain an approximation to log p(y). This approximation to the log marginal 
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likelihood is later used in the VGA as a model selection criterion. 

4 Hierarchical Centering 

Gelfand et al. (1995) discussed how reparametrizations of normal LMMs using hierarchical 
centering can improve convergence in MCMC algorithms. In our later examples we encounter 
situations where there is weak identification of certain model parameters and Algorithm 1 
converges very slowly. We apply hierarchical centering and show empirically that there is a 
gain in efficiency in variational algorithms through hierarchical centering, similar to that in 
MCMC algorithms. In Section 6 we give some theoretical support for this observation. 

We consider a case of partially centered parametrization in which = W{ and a second 
case of fully centered parametrization in which X; L = Wi = Vi in (1). In the first case, we 
introduce rji = (3j + a>i conditional on Zi = j so that (1) is reparametrized as 

yi = Xtfi + Vibj + et 

and rji is 'centered' about (3j, with rji ~ N(pj,a 2 .I p ). Writing 77 = (rjj , ...,ri^) T , the set of 
unknown parameters is now 9 = (/3 T \r] T \b T \ a 27 \crf r ', a 2T , 5 T , z T ) T . We replace q(a) in the 
variational approximation with q(r)) = YYi=i QiVi) > where q(r)i) is N(fj^.,H^.), for i — 1, ...,n. 
In the second case of full centering, we introduce = Vj + a« and Vj = j3j + bj, conditional 
on Zi = j so that (1) is reparametrized as 

Ui = XiPi + €i, 

with pi 'centered' about Vj and Vj 'centered' about /3j. We have pi ~ N(i/j,a 2 J p ) and 
Uj ~ N((3j,<T%.I p ). Writing v = (vf , v^) T and p = (pj , Pn) T , the set of unknown pa- 
rameters is 9 = (/3 T , z/ T , p T , a^ T , a% T ', a 2T ' , 5 T , z T ) T . We replace q(a) and q(b) in the variational 
approximation with q(p) = YYi=i Q(Pi) an d Q{ v ) = ¥[^=1 Qi^j) , where q(pi) is N(p J q p .,Y:'j ) .) for 
% — 1, n, and q{i/j) is N(pl., E£ ) for j = 1, fc. The variational algorithms with partial 
centering and full centering reparametrizations are known as 'Algorithm 2' and 'Algorithm 
3' respectively. The variational lower bounds and parameter updates can be computed as 
before and can be found in the supplementary materials. The variational posterior for 5 can 
be relaxed to be a normal distribution at convergence and similar adjustments (discussed in 
Section 3) apply to the variational lower bounds for Algorithms 2 and 3. 
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5 Variational Greedy Algorithm 



In the greedy algorithm, VA refers to Variational Algorithm which can be Algorithm 1, 2 or 
3. Let fk denote the /c-component mixture model and Ck the set of k components that form 
the mixture model The greedy learning procedure can be outlined as follows. 

1. Compute the one-component mixture model f\ using VA. 

2. Find the optimal way to split each of the components in the current mixture fk- This 
is done in the following manner. For each component Cj* G Ck, form Aj* = {i G 
{1, n) : j* = argmaxi<j<fe g^}, where {qij, l<i<n,l<j<k} are the variational 
posterior probabilities of f k . For m — 1, M, 

• randomly partition A,* into two disjoint subsets Aji* and Aj 2 * and form a (k + 
l)-component mixture by splitting the variational posterior probabilities of Cj* 
according to A,i* and Aj2*. That is, we create two subcomponents Cj\* and Cj2* 
such that for Cji*, is equal to the variational posterior probabilities of Cj* in 
fk if the ith observation lies in Aji* and zero otherwise, / = 1,2. The variational 
parameters of Cj\* and Cj2* required for initialization of the VA are set as equal 
to that of Cj*. 

• Variational parameters of all other components are set as those in fk- In the 
application of the VA, we do not update the variational parameters of components 
in Ck — Cj* as we are only interested in learning the optimal way of splitting Cj* . 
Hence, we apply only a 'partial' VA during this search. 

For each component Cj* G Ck, choose the run with the highest attained lower bound 
among M runs as that yielding the optimal way of splitting Cj* . Let Lj* and denote 
the lower bound and {k + l)-component mixture model respectively corresponding to 
the optimal way of splitting Cj* . 

3. The components in Ck are sorted in descending order according to Lj* and then split in 
this order. After the Zth split, the total number of components in the mixture is k + I. 
Let f t k™ p denote the mixture model obtained after / splits. Suppose that at the (l + l)th 
split, the component in Ck being split is Cj*. We apply a 'partial' VA again, keeping 
fixed variational parameters of components awaiting to be split. For the initialization, 
we set the variational parameters of Cji* and Cj2* to be equal to those in and 
the remaining variational parameters to be equal to those in f k e ^ p if / > 1 and fj* ht if 
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I — 1. A split is considered 'successful' if the estimated log marginal likelihood increases 
after the split. This process of splitting components is terminated once we encounter 
an unsuccessful split. 

4. If the total number of successful splits in step 3 is s, then a (k + s)-component model 
fk+f is obtained at the end of step 3. We apply VA on f^™f until convergence updating 
all variational parameters this time to obtain mixture model fk+s- 

5. Repeat steps 2-3 until all splits of the current mixture model are unsuccessful. 

We have experimented with several dissimilarity measures based on Euclidean distance 
as well as variability- weighted similarity measures (Yeung et al, 2003) in the case of repeated 
data to partition Aj* in step 2. Generally, VGA performed better when a random partition 
was used. Methods such as /c-means clustering are also difficult to apply when there is missing 
data. The partitioning of Aj* into two disjoint subsets in step 2 serves only as an initialization 
to the 'partial' VA to be carried out in search of the optimal way to split component Cj*. 
If we consider an outright partitioning of the data by assigning observation % to the j*th 
component if j* = arg maxi where < i < n, 1 < j < k} are the variational 

posterior probabilities, it is still possible for observations originally from different components 
to be placed in the same component again in steps 3 and 4. This is due to the updating of 
the variational posterior probabilities qij of all components which have been split in step 3 
and that of all existing components in step 4. 

The amount of computation is greatly reduced by the use of a 'partial' VA as the algorithm 
converges quickly when the variational parameters of all other components (except for the two 
subcomponents arising from the component being split) are fixed. As we are using only the 
run with the highest attained lower bound out of M runs, it is not computationally efficient 
to continue every run to full convergence and we suggest using 'short runs' in this search step. 
In later examples, we terminate each of these M runs once the increment in the lower bound 
is less than 1. Suppose we are trying to split a component Cj* into two subcomponents 
Cji* and Cj2*- After applying 'partial' VA, the variational posterior probabilities of one of 
the two subcomponents sometimes reduce to zero for all of % — l,...,n, so that the two 
subcomponents effectively reduced to one. When this happens on the attempt leading to the 
highest variational lower bound among all M attempts to split Cj*, we suggest omitting Cj* 
in future splitting tests. This reduces the number of components we need to test for splitting 
and can be very useful when the number of components grows to a large number. For the 
examples discussed in this paper, we set M to be 5 and the variational algorithm is deemed 
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to have converged when the absolute relative change in the lower bound is less than 1CT 5 . 
We note that the number of mixture components returned by the VGA may vary due to 
the random partitions in step 2 although the variation is relatively small compared to the 
number of clusters returned. The biggest advantage of the VGA is that it performs parameter 
estimation and model selection simultaneously and automatically returns a plausible number 
of components. It is possible however for the VGA to overestimate the number of components 
and some optional merge moves may be carried out after the VGA has converged if the user 
finds certain clusters to be very similar. This can be done quickly using a partial ' VA' in which 
the variational parameters of all other components except the two to be merged are fixed. A 
merge move is considered 'successful' if the estimated log marginal likelihood increases when 
two components are merged. While the VGA has been applied repeatedly in the examples for 
the purpose of analysing its performance, the user need only apply it once and may consider 
some merge moves if he finds clusters which are very similar. If multiple applications are used, 
we suggest using the estimated log marginal likelihood as a guideline to select the clustering 
solution. While reparametrizations using hierarchical centering increases the efficiency of the 
VGA, we have not observed that the number of components returned differs significantly 
due to the reparametrization. 



6 Rate of convergence of variational approximation 

In this section, we show that the approximate rate of convergence of the variational al- 
gorithm by Gaussian approximation is equal to that of the corresponding Gibbs sampler. 
As reparametrizations using hierarchical centering can lead to improved convergence in the 
Gibbs sampler, this result lends insight into how such reparametrizations can increase the 
efficiency of variational algorithms in the context of MLMMs. This is because the joint pos- 
terior of the fixed and random effects in a LMM is Gaussian (with Gaussian priors and 
Gaussian random effects distributions) when the variance parameters are known. 

Let the complete data be Y aug = (Y obs , Y mis ) where Y obs is the observed data and Y mis is the 
missing data. Let the complete data likelihood be p(Y aug \9) where 9 is a p x 1 vector and Y mis 
rx 1. Suppose the prior for 9 is p(9) oc 1 and the target distribution is p(9, Y mis ) — N (( ^ ) , S) 
where S = ( I 11 I 12 ). Let H = = ( £ n ) . It can be shown that 

p(Y mis \9, Y obs ) = N(fi 2 - H^Hnie - Aii), H£) and 
p{9\ Y mis , Y obs ) = N( f i 1 - H^H 12 {Y mis - /i 2 ), H^) . 
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Sahu and Roberts (1999) showed that under such conditions, the rate of convergence of the 
EM algorithm alternating between the two components 9 and Y mis is equal to the rate of 
convergence of the corresponding two-block Gibbs sampler. This rate is given by p(B EM ), 
where B EM = H^H^H^H^ and p(.) denotes the spectral radius of a matrix. 

In the variational approach, we seek an approximation q(9,Y mis ) to the true posterior 
p(9, Y mis \Y obs ) for which the KL divergence between q and p(9, Y mis \Y obs ) is minimized subject 
to the restriction that q{9,Y mis ) can be factorized as q{9)q{Y rnis ) . The optimal densities are 

q(Y mis ) = N (p 2 - H 22 H 21 (p\ - iii), H 22 ) and 
q{9) =N{ f i 1 - H u l H l2 {& mis - p 2 ), tffi 1 ) , 

where p q 6 and p\r . denote the mean of q{9) and q(Y mis ) respectively. Starting with some ini- 
tial estimate for p q &) we can iteratively update the parameters p q e and p^y mis until convergence. 
Let ul^ and /iy*' denote the tth iterates. It can be shown that 

4^ = H 22 l H 2l H^H 12 42 is + {h ~ H£H 21 H^H l2 ) ^ and 
fl f+^ = B EM f4^ + (l p -B EM ) f i 1 . 

The matrix rate of convergence of an iterative algorithm for which = M{9^) and 9* 

is the limit is given by DM(9*) where DM{9) = (^f 1 )- A measure of the actual observed 
rate of convergence is given by the largest eigenvalue of DM{9*) (Meng, 1994). The rate 
of convergence of p q e is therefore p(B EM ). Since H 22 H 21 H^H 12 and B EM share the same 
eigenvalues, the rate of convergence of p Y . is also p(B EM ). The overall rate of convergence 
of the variational algorithm is thus p(B EM ). 

Suppose we impose a tougher restriction on q(9, Y m i s ). For a partition of 9 into m groups 
such that 9 = {9\, 9, m ) with 0j a x 1 vector and ^ ri = p, we assume that q(9, Y mis ) can 
be factorised as YliLi Q(8i)q(Y mis ). The optimal density of q(Y mis ) remains unchanged. Let 
Pi = (/in, — , Pim) and 

(An A12 ... Aim 
A21 A 22 ... A 2m 
.... 
A m i A m 2 ••■ A mm 

be partitioned according to 9 = (9 1: ...,9 m ). The optimal density of q(9i) is 

N{t*ii - A„ 'X! A u(/'^ " Vij) ~ K^H l2 {p q Ymis - p 2 ),A^) for i = 1, ...,m. 
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This leads to the following iterative scheme. After initializing i = l,...,m., we cycle 
though updates: 

• ^Ymis <~ V"i ~ H 22 H 2l (n q - Hi) 



till convergence. Consider the (£ + iteration. For notational simplicity, we replace (/jif — 



„ M ) by Af , („f - ^) by Af and - „ 2 ) by A™. Since Ag 



we have 



/An 

A21 A22 



\ 



9(*+l) 
02 



\A m i A m 2 ... A mm J 



A 



/0 A 



+ 



12 



A lm \ 



... A 



2m 



82 



\0 ... / 



A 



= H U B* M \ 



EM\q(t) 



Let L be the lower triangular matrix of ifn and U = L — Hu. Then 

L\f +l) - U\f> = H 11 B EM \f ) 

& xf +l) = L^UXf + L-\L - U)B EM X q{t) 
& X q{t+l) = [B aug + (I p - B^B^jxf 

where B aug = L~ l U . Therefore the rate of convergence of X q g and hence, that of [i\ is p(B aug + 

-, the rate 



#(*+!)- 



(I p — B aug )B EM ). As the rate of convergence r, is defined as r = lim^oo ^ , 
of convergence of Ay and hence n Y . is given by 



lim 

t— >oo 





1 mis 


xf • 

1 mis 


-X'y* 

1 mis 



lim 

t— >oo 



'H 22 H 2 iX q J" ■* + H 22 H 2 iXg 



-H 22 l H 21 X q } t ^ + H 22 1 H 2 iX q 6 



= lim 



xf - 




\q(t-l) 





which is equal to the rate of convergence of n q e . The overall rate of convergence of the varia- 
tional algorithm is thus p(B aug + (I p — B aug )B EM ) which is equal to the rate of convergence of 
the Gibbs sampler that sequentially updates components of 9, and then block updates Y mis 
derived by Sahu and Roberts (1999). Although the theory developed may not be directly 
applicable to LMMs with unknown variance components as well as MLMMs in general, it 
suggests to consider hierarchical centering in the context of variational algorithms and our 
examples show that there is some gain in efficiency due to the reparametrizations. 
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7 Examples 



To illustrate the methods proposed, we apply VGA using Algorithms 1, 2 and 3 on three 
real data sets (application of Algorithm 2 on yeast galactose data set can be found in sup- 
plementary materials). We also consider simulated data sets in Section 7.3 where VGA is 
compared with EMMIX-WIRE (Ng et al, 2006). In Section 7.2, we report the gain in ef- 
ficiency from reparametrization of the model using hierarchical centering. In the examples 
below, an outright partitioning of the data is obtained by assigning observation i to the j*th 
component if j* = argmaxx^xfc q^, where {^,1 < i < n, 1 < j < k} are the variational 
posterior probabilities of the mixture model obtained using VGA. 

7.1 Clustering of time course data 

Using DNA microarrays and samples from yeast cultures synchronized by three independent 
methods, Spellman et al. (1998) identified 800 genes that meet an objective minimum crite- 
rion for cell cycle regulation. We consider the 18 a-factor synchronization where the yeast 
cells were sampled at 7 min intervals for 119 mins and a subset of 612 genes that have no 
missing gene expression data across all 18 time points. This data set was previously analyzed 
by Luan and Li (2003) and Ng et al. (2006) and is available online from the yeast cell cycle 
analysis project at http:/ /genome- www.stanford.edu/cellcycle/. Our aim is to obtain an op- 
timal clustering of these genes using the VGA. Following Ng et al. (2006), we take n = 612 
genes, Wi = li 8 , V* = i"i 8 , w« = 1 and X { to be an 18 x 2 matrix with the (/ + l)th row 
(I = 0, 17) as (cos(27r(7/)/a;), sin(27r(7Z)/a;), where u = 53 is the period of the cell cycle 
for i — 1, ...,n. For the error terms, we take g — 1 and Kn = 18 for i — 1, ...,n so that the 
error variance of each mixture component is constant across the 18 time points. We used the 
following priors, 5 ~ N(0, 10007), ^ ~ N(0, 10007) for j = l,...,k, and 7(7(0.01,0.01) for 

CT v J = and a ji> j = i,---^ 1 = 1 ,-,9- 

Applying the VGA using Algorithm 1 ten times, we obtained a 15-component mixture 
once, a 16-component mixture six times and a 17-component mixture thrice. The mode is 16 
and we report the clustering for a 16-component mixture obtained from the VGA in Figure 
1. For this clustering, we attempted several merge moves on clusters which appear similar 
such as 3 with 4, 5 with 6, 7 with 8 and 13 with 14. These merge moves did not result in 
a higher estimated log marginal likelihood. However, we observed that cluster 2 (48 genes) 
was split into two clusters in one of the 17-component mixture models and these two clusters 
can be merged successfully with a higher estimated log marginal likelihood being obtained. 
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Figure 1: Clustering results for time course data obtained from applying the VGA using 
Algorithm 1. The x-axis are the time points and y-axis are the gene expression levels. Line 
in grey is the posterior mean of the fixed effects given by X^^.. 
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Figure 2: Clustering results for water temperature data. The x-axis is the depth and y-axis 
is the water temperature. 

Thus, it is possible for the VGA to overestimate the number of mixture components and 
merge moves can be considered when similar clusters are encountered. We note however that 
the variation in the number of mixture components returned by the VGA is relatively small. 
For this data set, the number of clusters returned by VGA was generally larger than that 
obtained by Ng et al. (2006) where BIC was used for model selection and the optimal number 
of clusters was reported as 12. Any interpretation of the differences in results would need to 
be pursued with the help of subject matter experts, but our later simulation studies tend to 
indicate that BIC underestimates the true model so that possibly our clustering is preferable 
from this point of view. Of course it may be argued that the ability to estimate the 'true 
model' is not a chief concern in clustering applications where interpretability of the results 
in the substantive scientific context is the primary motivation. 

7.2 Clustering of water temperature data 

We consider the daily average water temperature readings during the period 9 September 
2010-10 August 2011 collected at a monitoring station at Upper Peirce Reservoir, Singapore. 
No data were available during the periods 23 December 2010-28 December 2010, 10 February 
2010-23 February 2010 and 14 April 2011-10 May 2011. Readings were collected at eleven 
depths from the water surface; 0.5m, 2m, 4m, 6m, 8m, 10m, 12m, 14m, 16m, 18m and 
at the bottom. Using data from the remaining 290 days, we apply the VGA to obtain a 
clustering of this data. We take n = 290, rii = 11 and X, = Wi = Vi = In for % — 1, ...,n. 
We set g — 11 with Ku — 1 for i — 1, ...,n, / = 1, ...,g so that the error variance of each 
mixture component is allowed to be different at different depths. For the mixture weights, 
we set Ui = (l,i,i 2 ,i 3 ), i — 1, ...,n, and subsequently standardize columns 2-4 in the matrix 
U = (uf , u^) T to take values between -1 and 1, centered at 0. We used the following 
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Figure 3: Fitted probabilities by gating function for clusters 1 to 6. The x-axis are days 
numbered 1 to 290 and y-axis are the probabilities. 

priors, 5 ~ N(0, 10007), ~ N(0, 100007) for j = 1, and 7(3(0.01,0.01) for o\., o\., 
j = l,...,k and cr^, j = l,...,k, I = l,...,g. Applying VGA with Algorithm 3 five times, 
we obtained a 6-component model each time with very similar results. The clustering of a 
6-component fitted model is shown in Figure 2 and the fitted probabilities from the gating 
function are shown in Figure 3. For comparison, we apply VGA with Algorithm 1 five times. 
A 6-component mixture model was obtained on all five attempts. The average CPU time 
taken to fit a 6-component model using VGA with Algorithm 1 was 2114 seconds compared 
to 932 seconds by Algorithm 3. In this example, hierarchical centering reparametrization has 
helped to improve the rate of convergence with the computation time reduced by more than 
half. The Upper Peirce Reservoir uses aeration devices intended to mix the water at different 
depths, with the aim of controlling outbreaks of phytoplankton and algal scums. On days 
when these aeration devices are operational, it is expected that there will be less stratification 
of the temperature with depth. Accurate records of the operation of the aeration devices were 
not available to us and there is some interest in seeing whether the clusters divide into more 
or less stratified components giving some insight into when the aeration devices were used. 

7.3 Simulation study 

We report results from a simulation study in which VGA is compared with EMMIX-WIRE 
developed by Ng et al. (2006). EMMIX-WIRE fits MLMMs by likelihood maximization 
using the EM algorithm and is able to handle the clustering of correlated data that may be 
replicated. We compare the performance of EMMIX-WIRE with VGA using 10 data sets 
simulated from model (1). Each data set consist of n — 499 vectors of dimension = 18 
and each contain 12 clusters of sizes 43, 48, 85, 49, 65, 77, 8, 21, 18, 15, 34 and 36. These 
clusters are based on the 16-component mixture model in Figure 1 fitted to the time course 
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11 
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12 


0.922 



Table 1: Simulation results comparing EMMIX-WIRE with VGA. 

data in Section 7.1 from which 12 distinctive clusters have been selected. In particular, we 
have left out clusters 6, 8, 14 and 16. The values of the unknown parameters (3j, a\ , of , 
j = 1, and o 2 - h j = 1, / = l,...,g in model (1) were taken to be equal to the 
variational posterior mean values of the 16-component mixture in Figure 1 and g — 1. The 
design matrices JQ, Wi and V$, i — 1, ...,n, are as described in Section 7.1 and we used the 
same priors as before. 

For each of the 10 data sets, we ran EMMIX-WIRE with the number of components rang- 
ing from 6 to 15 and used the BIC for model selection. The optimal number of components is 
taken to be that which minimizes —2 log (Life) + (par ) logra, where par denotes the number of 
parameters in the model and Lik is the likelihood. We used the approximation of Lik from 
the output of EMMIX-WIRE for the computation of the BIC. See Ng et al. (2006) for details 
on how the likelihood was approximated. We ran EMMIX-WIRE again, this time fixing the 
number of components as 12. We also applied the VGA with Algorithm 1 once for each of the 
10 data sets. The adjusted Rand Index (ARI) (Hubert and Arabie, 1985) for the clustering 
of the fitted model relative to the true grouping of all 499 observations into 12 clusters was 
then computed in each case. The results are summarized in Table 1. From Table 1, the ARI 
attained by VGA was consistently higher than that attained by EMMIX-WIRE. It is also 
interesting to note that in almost all the ten sets of simulated data, the ARI attained by the 
12-component model fitted by EMMIX-WIRE was higher than that attained by the optimal 
model identified by BIC. So BIC tends to underestimate the number of components here, 
although the implications of this for applications in clustering algorithms may be less clear. 
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8 Conclusion 

We have proposed fitting MLMMs with variational methods and developed an efficient VGA 
which is able to perform parameter estimation and model selection simultaneously. This 
greedy approach handles initialization automatically and returns a plausible value for the 
number of mixture components. The experiments we have conducted showed that the VGA 
does not systematically underestimate nor overestimate the number of mixture components. 
For the simulated data sets considered, VGA was able to return mixture models where the 
number of mixture components is very close to the correct number of components. We further 
showed empirically that hierarchical centering can help to improve the rate of convergence in 
variational algorithms significantly. Some theoretical support was also provided for this ob- 
servation. Implementation of the VGA is straightforward as no further derivation is required 
once the basic variational algorithms are available. This greedy approach is not limited to 
MLMMs and could potentially be extended to fitting other models using variational meth- 
ods. All code was written in the R language and run on a dual processor Window PC 3GHz 
workstation. 

9 Supplementary materials 

The derivation of the variational lower bound in (3) and the expressions of the variational 
lower bounds and parameter updates for Algorithms 2 and 3 can be found in the supple- 
mentary materials. An example on application of Algorithm 2 to yeast galactose data is also 
included. 
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Derivation of variational lower bound 

The variational lower bound can be written as E q (\ogp(y,9)) — E q (logq(9)), where E q (-) 
denotes the expectation with respect to q. Consider the first term, E q (\ogp(y,9)). Let Qj = 
I(zi = j) where /(•) denotes the indicator function. We have 

n k 

logp(y, 9) = Yl & \}°^P(Vi\ z i = 3, Pj, a h bj, Ey) + log p(a,i\ol.) + logp^j + logp(5) 
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Taking expectations with respect to q, we obtain 
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where T(-) and ip(-) denote the gamma and digamma functions respectively, pij is evaluated 
by setting 5 = fi q s , p([i q 5 ) denotes the prior distribution for 5 evaluated at fi q , ^ = yi — Xiji^ — 

WifMl - V^l, S^- 1 = blockdiag J/,,.,) and A,, = X^Xf + W^Wj + 

ViY<ly t T . Turning to the second term, E q (\ogq(9)), we have 
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and putting (1) and (2) together gives the lower bound. 

Algorithm 2 (partial centering when X{ = W{) 



(Updates in steps 5, 7, 9, 10, 11 and 13 remain the same as in Algorithm 1 with s\ = p.) 

a%- a b- a q 

Initialize: for % = l,...,n, j = 1, /x^., -^r and -^r for j = 1, for 

3 3 a j bj jl 

j = 1, I = l,...,g. Do until the change in the lower bound between iterations is less 
than a tolerance: 
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The variational lower bound is given by 
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where Wy = Hi — — Ka^.- In the examples, when Algorithm 2 is used in conjunction 

with the VGA 

for I = l,...,g, 



with the VGA to fit a 1-component mixture (j = 1), we set — 1 for % — 1, ...,n, ^ 
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w 2 - = 1, -^r = 0.1, /i^. = and /A. = for initialization . 
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Algorithm 3 (full centering when Xi = W{ = Vi) 



(Updates in steps 7, 9, 11 and 13 remain the same as in algorithm 1 with s\ = S2 = p.) 

a q a . a l. a q 

Initialize: qij for % — 1, n, j = 1, k, /^., /il yj 2 - and w 2 - for j = 1, k, for j = 1, k, 

J 3 dj frj ji 

I = 1, g. Do until the change in the lower bound between iterations is less than a tolerance: 
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The variational lower bound is given by 
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In the examples, when Algorithm 3 is used in conjunction with the VGA to fit a 1-component 
mixture (j = 1), we set — 1 for % — 1, ...,n, = 10 for Z = 1, ...,<?, vj 1 = 0.1, = 0.01, 

jl a j bj 

— and = for initialization. We note that the rate of convergence of Algorithm 3 
can be sensitive to the initialization of -rP- and -yf and observed that an initialization 

jl a j bj 

cel. «2. a q 

satisfying t^ 2 - < -zP- < works better. 

% A 3 i 

Example on clustering of yeast galactose data 

The yeast galactose data of Ideker et al. (2001) has four replicate hybridizations for each of 20 
cDNA array experiments. We consider a subset of 205 genes previously analyzed by Yeung et 
al. (2003) and Ng et al. (2006) whose expression patterns reflect four functional categories in 
the Gene Ontology (GO) listings (Ashburner et al., 2000). This dataset is available from the 
online version of Yeung et al. (2003). Approximately 8% of the data are missing and Yeung 
et al. (2003) used a /c-nearest neighbour method to impute the missing data values. Yeung et 
al. (2003) and Ng et al. (2006) evaluated the performance of their clustering algorithms by 
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how closely the clusters compared with the four categories in the GO listings. They used the 
adjusted Rand index (Hubert and Arabie, 1985) to assess the degree of agreement between 
their partitions and the four functional categories. 

We use this example to illustrate the way that our model can make use of covariates in 
the mixing weights, unlike previous analyses of this data set. In particular, we use the GO 
listings as covariates in the mixture weights. Let Ui be a vector of length d — 4 where the /th 
element is 1 if the functional category of gene i is / and otherwise. Instead of looking at 
the data with the missing values imputed by the /c-nearest neighbour method, we consider 
the original data containing 8% missing values, since our model has the capability to handle 
missing data. Taking n = 205 genes, let y itr denote the rth repetition of the expression profile 
for gene i at experiment t, < r < 4, and Ru denote the number of replicate hybridizations 
data available for gene % in experiment t, i — 1, 205, t — 1, 20. For each i — 1, n, yi is 
a vector of n { observations where = Y?t=\ R u and Vi = (Vm, —,ynA, -,J/*,2o,i, - , 2/i,2o,4) T , 
with missing observations omitted. Vi is a x 80 matrix obtained from 7 80 by removing the 
(tr)th row if the observation for experiment t at the rth repetition is not available. Xj is a 
Hi x 20 matrix, 

^ Iru Ri1 ... Ril \ 
Ri2 l Ri2 ... Ri2 

A i — . . 

\0Ri2o Rl20 ... l Rt20 J 

and Wi = Xj. For the error terms, we set g = 20 with n a = R a , i = l,...,n, / = l,...,g, 
so that the error variance of each mixture component is allowed to vary between different 
experiments. We used the following priors, 5 ~ iV(0, 10007), /3j ~ iV(0, 10007) for j = 1, k, 
and 7G(0.01,0.01) for a 2 a ., <r$., j = and o) x , j = l,...,k, I = 1, 



Applying the VGA using Algorithm 2 five times, we obtained a 8-component mixture 
on all five trials with similar results. The clustering of a 8-component mixture with the 
highest estimated log marginal likelihood among the five trials is shown in Figure 1. Our 
clustering results are slightly different from Ng et al (2006) where the number of optimal 
clusters obtained was 7. We note that genes from Category 1 were split into 3 clusters by the 
VGA on all 5 attempts while in Ng et al (2006), genes from Category 1 were subdivided into 
2 clusters. In addition, instead of having one cluster containing all the genes from Category 
4, we observed that two or three of the genes in Category 4 were consistently separated from 
the cluster containing the remaining genes from Category 4. Fitted probabilities from the 
gating function are shown in Figure 2. These were obtained by substituting 5 with ji q & from 
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cluster 5 (37 genes) 



cluster 1 (12 genes) 



cluster 2 (10 genes) 



cluster 3 (36 genes) 




Figure 1: Clustering results for yeast galactose data from applying the VGA using Algorithm 
2. The x-axis are the experiments and y-axis are the gene expression profiles. GO listings 
were used as covariates in the mixture weights. 



oo 
d 



d 



o 
d 



L 




■ 


Cat 1 


□ 


Cat 2 


□ 


Cat 3 


□ 


Cat 4 



5 6 



Figure 2: Fitted probabilities from gating function. The x-axis are the clusters and y-axis 
are the probabilities. 



7 



the variational posterior into P(zi = j) = pij = ^^—^r^j which represents the probability 
that observation i belongs to component j of the mixture conditional on the category that 
observation i belongs to in the GO listings. 

To investigate the impact of reparametrizing the model using hierarchical centering, we 
applied VGA using Algorithm 1 five times. This time, we obtained a 8-component mix- 
ture four times and a 9-component mixture once. The average estimated log marginal log 
likelihood attained by Algorithm 1 was 8491, lower than the average of 8999 attained by 
Algorithm 2. For fitting a 8-component model, VGA with Algorithm 1 took an average of 
8455 seconds, while Algorithm 2 took an average of 3185 seconds. While these results may 
not be conclusive, the gain in efficiency in using Algorithm 2 over Algorithm 1 is clear. By 
using hierarchical centering, the computation time was reduced by more than half in this 
example. 
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